In this study, we will explore different ways that social relationships can be clustered in order to find categories of relationships. To find relationship categories we will use a two-step process 1. Use a dimension reduction technique to organize relationships into a two-dimensional space that will reveal clusters of relationships 2. Use a clustering algorithm on the output of step 1 to assign relationships to clusters

There are many different dimension reduction techniques and clustering algorithms. Each of these steps include different parameters and choices that must be made that can effect the final output. We will explore as much of this as possible.

Social relationship ratings ratings

This is a matrix of the average ratings of 159 social relationships on 30 dimensions from the literature.This data was collected as part of Study 3.

PCA Data Clustering

In Study 3, we used PCA as a dimensionality reduction technique to find the over-arching dimensions the can represent social relationships knowledge. In the following analyses, we will try to find categories of social relationships, using data-driven methods.

By visualizing the PCA relationship plots, we may be able to see some clusters that are present. Here, we will cluster the relationships by using the scores for each relationship for the first four components.

Determine the optimal number of clusters

We calculated the optimal number of clusters using silhouette scores for each cluster solution.
The recommended number of clusters for k-means is 5 and for hierarchical clustering is 5 clusters.

K-Means clustering

Five cluster solution

Hierarchical clustering

Five cluster solution


UMAP

Uniform manifold approximation and projection (UMAP) is another dimension reduction technique. Whereas t-SNE (discussed in supplementary) conducts dimension reduction and aims to retain the local structure information of the data, UMAP aims to retain some of the global structure as well.

There are three parameters to consider when using UMAP:

  • Nearest neighbor
    • The number of approximate nearest neighbors used to construct the initial high-dimensional graph
    • Effectively, this parameter controls how UMAP balances the local versus global structure of the original data
    • Dependent on the size of your data (smaller values are needed for small number of data points)
  • Minimum Distance
    • Minimum distance between points in low-dimensional space
    • Effectively, this parameter controls how tightly UMAP clumps points together
  • Distance function
    • The toolbox we are using will allow us to use a variety of distance functions

For the distance function, we will use the most popular in the field, Euclidean distance. For the other two parameters, we will explore the results when tweaking these parameters.

More information on UMAP can be found here:

Determine the nearest neighbor and minimum distance

The nearest neighbor is the most important parameter as it determines amount of local vs global information to retain. A lower value will capture more local structure, and therefore, provide results more similar to t-SNE. A higher value will capture more global structure.

We will explore UMAP results with a range of parameters. It is important to note that the nearest neighbor parameter is dependent on the size of your data. So we will use a small value (2) up to a large value of about a quater of your data (40)
The columns and rows of the above figure indicate the value of the nearest neighbor and the minimum distance parameters, respectively. The advantage of UMAP over t-SNE is that it can retain more of the global structure from the original data. Therefore, we will use a moderate nearest neighbor. We also want to be able to cluster relationships, and so we should use a smaller minimum distance value. Therefore, we will use a nearest neighbor of 5 and a minimum distance of 0.1 for the clustering analysis.

Determining the optimal number of clusters


Both kmeans clustering and hierarchical clustering indicate the two clusters would be optimal. However, this seems unintuitive and so we will go with the second-best data-driven solution, six clusters.

K-Means clustering


Note: Although “dimensions” have been outputted, along which the relationships have been plotted, the dimensions are difficult to interpret. This is an inherent feature of UMAP and t-SNE since they sacrifice capturing global information, for more of the original local information.

Hierarchical clustering


Supplementary Analyses

t-SNE

t-SNE is a dimensionality reduction technique. Whereas PCA attempts to organize data to explain the most variance and capture the most global structure, t-SNE focuses on capturing the local structure of the data. In this analysis, we will run t-SNE on the original data of 159 relationships rated on 30 dimensions. We will then use clustering algorithms to label the results and provide us with a categorization of the social relationships.

Determining the perplexity parameter

Perplexity is a tuneable parameter of t-SNE. It gives a method of balancing how t-SNE will try to balance local vs global structure of data. Perplexity values are usually between 5-50, where a low perplexity focuses on more local structure and a higher perplexity focuses on more global structure. More information on perplexity and t-SNE can be found on the creator’s official webpage.


It seems that a perplexity of 5 gives us specific clusters. Since we are interested in creating relationship categories, we can ignore the more global structure of the data as PCA does a better job of this, and focus on creating distinct clusters of relationships

Determine the optimal number of clusters


The recommended number of clusters for k-means is 6 and for hierarchical clustering is 6 clusters.

K-Means clustering

Hierarchical Cluster